Method for producing reference segments describing voice modules and method for modeling voice units of a spoken test model

ABSTRACT

A method models voice units and produces reference segments for modeling voice units. The reference segments describe voice modules by characteristic vectors, the characteristic vectors being stored in the order in which they are found in a training voice signal. Alternative characteristic vectors are associated with each characteristic vector. The reference segments for describing the voice modules are combined during the modeling of larger voice units. In the event of identification, the respectively best adapted characteristic vector alternatives are used to determined the distance between a test utterance and the larger vocal units.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is based on and hereby claims priority to PCTApplication No. PCT/DE02/03717 filed on Oct. 1, 2002 and GermanApplication No. 101 50 144.7 filed on Oct. 11, 2001 the contents ofwhich are hereby incorporated by reference.

BACKGROUND OF THE INVENTION

The invention relates to a method for producing reference segmentsdescribing speech modules and a method for modeling speech units of aspoken test model in voice recognition systems.

Earlier commonly encountered speech recognition systems are based on thedynamic time warping (DTW) principle. In this situation, for each word acomplete sequence of characteristic vectors—obtained from a trainingutterance for this word—is saved as a reference model and compared inthe operational phase with a test model of a voice signal to berecognized by non-linear mapping. The comparison serves to determine theminimum distance between the respective reference model and the testmodel, and the reference model having the smallest distance from thetest model is selected as the reference model that suitably describesthe test model.

The disadvantage with this method is the fact that a reference modelneeds to be stored for every word to be recognized, as a result of whichthe code book containing the reference models is extremely extensive andthe effort involved in training a voice recognition system of such atype whereby a reference model is saved for every word iscorrespondingly great. In this situation, it is not possible to generatereference models for words differing from the learned languagevocabulary. According to the present publication, the characteristicsrepresenting the reference models, which are obtained by theauto-correlation function in each case for successive analysis windowsat a distance of 10 ms for example, and are subsequently referred to asauto-correlation characteristics, and the spectral characteristics areexplained.

The auto-correlation characteristics describe the voice signal containedin the analysis window in the time range and the spectralcharacteristics, which are obtained by a Fourier transformation,describe the voice signals in the frequency range. In addition, severaldifferent distance measurements for determining a distance between twocharacteristic vectors are explained. In order to improvespeaker-independent recognition, with regard to this known method aplurality of reference models is produced for each word, whereby thereference models are in turn ascertained as an averaged value from aplurality of training signals. In this situation, both the timestructure of the entire reference model and also the characteristicstructure can be ascertained as averaged values. In order to producegroups of reference models which are assigned to a word in each case andexhibit an averaged time structure, training models are mapped innon-linear fashion to an averaged model assigned to this word or wordclass and then a clustering of the characteristic vectors for thetraining model and of the reference models already present in the classis carried out separately for each analysis window.

By using this special method it is possible to achieve an extremely goodrecognition rate, but it is however subject to the disadvantages of theDTW method already described above.

More recent voice recognition systems are based on the HMM method(hidden Markov modeling). In this situation, in the training phase voicesegments (for example phonemes or syllables) are collected from a largenumber of voice signals from different words and are subdivided intonodes (for example one node each perword-initial/word-internal/word-final sound). The characteristic vectorsdescribing the voice signals are assigned to the node and stored in acode book.

With regard to speech recognition, the test model is mapped by anon-linear mapping process (for example with the aid of the Viterbialgorithm) onto a sequence of nodes defined by the transcription (forexample a phonetic description) of the word. Since the nodes onlydescribe word segments, reference models for practically any desiredword of a language can be produced by concatenation of the nodes orsegments. Since as a rule there are normally distinctly fewer phonemesor syllables than words in a language, the number of nodes issignificantly less than the number of reference models describingcomplete words to be stored with regard to the DTW method. As a result,the training effort required for the voice recognition system issignificantly reduced when compared with the DTW method.

A disadvantage with this method is however the fact that the timingsequence of characteristic vectors can no longer be ascertained within anode. This is a problem particularly in the case of long segments—suchas an extended German “a” for example, in which instances a very largenumber of characteristic vectors of similar nodes frequently fitalthough the timing sequence of the vectors does not match. As a result,the recognition rate can be seriously impaired.

In Aibar P. et al.: “Multiple template modeling of sublexical units”,in: “Speech Recognition and Understanding”, pp. 519 to 524, SpringerVerlag, Berlin, 1992 and also in Castro M. J. et al.: “Automaticselection of sublexic templates by using dynamic time warpingtechniques”, and in: “Proceedings of the European Signal ProcessingConference”, Vol. 5, No. 2, pp. 1351 to 1354, Barcelona, 1990 asegmentation of a training voice signal into speech modules and ananalysis for obtaining a characteristic vector are described. In thissituation, averaging is also carried out. In Ney H.: “The use of aone-stage dynamic programming algorithm for connected word recognition”,and in: “IEEE Transactions of Acoustics, Speech, and Signal Processing”,pp. 263 to 271, Vol. ASSP-32, No. 2, 1984 the recognition of words in acontinuously 3 a uttered sentence is disclosed, whereby a referencetemplate is used for each word.

SUMMARY OF THE INVENTION

One possible object of the invention is therefore to set down a methodfor producing reference segments describing speech modules and a methodfor modeling speech units which enable high recognition rates to beachieved in a voice recognition system with a low training effortrequirement.

The method uses or produces reference segments describing speechmodules, which contain time structured characteristics.

These reference segments are produced in a training phase which can takeplace as described in the following, for example:

-   -   selection of suitable word subunits (phonemes, diphones,        syllables, . . . ) as speech modules,    -   determination of an average time structure for the sequence of        the characteristic vectors for the selected speech modules from        a large number of examples of speech,    -   selection and assignment of characteristic vectors for each of        the time windows of the typical time structure    -   storage of the models determined by this means for each speech        module, which represent the reference segments and are formed        from characteristic vectors which are arranged in a code book in        accordance with the time structure.

The recognition phase can take place as described in the following, forexample:

-   -   combination of the reference segments to form a reference model        for a speech unit, such as a word to be recognized for example        (in accordance with the phonetic description of this word, for        example),    -   execution of a non-linear comparison of the test model to be        recognized with the reference models and determination in each        case of an overall distance between the reference models and the        test model, whereby the minimum distance between the        characteristic vector of the test model and the typical        characteristic vectors of the speech modules assigned by way of        the non-linear comparison is used for each time window,    -   selection of the reference model having the smallest distance        from the test model.

The method uses a code book containing reference segments describingspeech modules and having time-structured characteristics, in otherwords such that the characteristics are stored in a specific sequence asreference segments.

Particularly important advantages result if the individual speechmodules of a reference model, such as a word for example, are forexample a phoneme, a diphone, a triphone or a syllable. That is to sayit is possible to combine the advantages of DTW and HMM systems by onthe one hand retaining the time structure but on the other hand alsobeing able to generate reference models for new words from existingsyllables.

These speech modules are described by reference segments having atypical time structure, whereby one or more characteristic vectors canbe provided for each time slot of a speech module. These strings ofcharacteristics or characteristic vectors with the respectivealternative characteristics per time window describe the speech modulesas they have typically occurred in the training models. As a result ofcombining a plurality of reference segments to form a reference model, areference model is obtained whose speech modules contain the timestructure ascertained during the training of the voice recognitionsystem, as a result of which the reference model formed in this mannerhas precisely the same fine time structure as is the situation in thecase of the known DTW method.

However, since the reference segments only describe individual speechmodules in each case, in the training phase only these referencesegments producing the speech modules need to be produced, and theirnumber is significantly less than the number of reference modelsaccording to the DTW method.

When compared with known voice recognition systems based on the HMMmethod, a significantly finer time structure for the reference models isobtained when using the method since the characteristic vectors assignedto a node of the HMM method are stored without time information and as aresult the HMM method exhibits no time structuring whatsoever within anode. This difference results in a substantial increase in therecognition rate achieved by the method compared with the HMM method.

A further advantage compared to the known HMM method is the fact that itis not necessary to produce special reference segments which takeaccount of the context—in other words the adjacent segments—since thegreater variance in the transition areas between adjacent speech modulescan be represented by additional characteristic alternatives. Inaddition, temporally long speech modules are subdivided into a pluralityof time windows as short speech modules, whereby the description of theindividual speech modules is effected with the same quality in the caseof both short and long speech modules. With regard to the known HMMmethod on the other hand, the speech modules are represented by aparticular, arbitrarily defined, number of nodes which is independent ofthe length of the speech modules.

The method for producing reference segments for modeling speech unitscomprises the following steps:

-   -   segmentation of the training voice signal into speech modules in        accordance with a predefined transcription,    -   analysis of the training signal in predetermined time slots with        particular time windows in order to obtain at least one        characteristic vector for each time window, as a result of which        training models are formed which in each case contain        characteristic vectors in the time sequence of the training        voice signal,    -   determination of an average time structure for each speech        module with a string of time windows,    -   assignment by a temporally non-linear mapping of the        characteristic vectors to the time windows of the speech modules        and storage of the characteristic vectors assigned in each case        to a speech module in the sequence predefined by the time        windows as a reference segment.

When using this method, reference segments are produced containingcharacteristic vectors which are present in the time sequence of thetraining voice signal and which in each case can be assigned to a speechmodule. By this means, the time structure of the training voice signalis mapped onto the reference segments, and since the reference segmentscan in each case be assigned to one speech module it is possible toconcatenate a reference model corresponding to a word from the referencesegments.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects and advantages of the present invention willbecome more apparent and more readily appreciated from the followingdescription of the preferred embodiments, taken in conjunction with theaccompanying drawings of which:

FIG. 1 shows in schematic form the method for producing referencesegments (in the training phase),

FIG. 2 shows in schematic form clustered characteristic vectors for thespeech module “a” which have been obtained from examples of speechhaving a different context,

FIG. 3 shows the method for modeling speech units in the form of aflowchart, and

FIG. 4 shows in schematic form a mapping matrix with the referencemodels, formed from reference segments and the test model,

FIG. 5 shows voice recognition in an HMM voice recognition system,

FIG. 6 shows the determination of the average time sequence for thephoneme structure “ts”,

FIG. 7 shows the determination of the average time sequence for thephoneme structure “ai”,

FIG. 8 shows methods used during averaging for the purposes ofnon-linear mapping and also the projection onto a resulting model,

FIG. 9 shows the search during recognition with states in accordancewith HMM voice recognition system according to the related art,

FIG. 10 shows the search during recognition with reference modelsaccording to the proposed method.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Reference will now be made in detail to the preferred embodiments of thepresent invention, examples of which are illustrated in the accompanyingdrawings, wherein like reference numerals refer to like elementsthroughout.

The method for modeling speech modules in voice recognition systems usesa code book in which reference segments are stored, which each describea speech module having time-structured characteristics. In the presentembodiment, in each case the speech modules represent phonemes ordiphthongs and the time-structured characteristics are spectralcharacteristic vectors which are stored in the reference segments in theorder which corresponds to a typical voice signal of the respectivespeech module.

The reference segments are subdivided into time windows, whereby aplurality of characteristic vectors can be assigned as alternatives toeach time window. Short phonemes, such as a “t” for example, can merelyexhibit a single time window. As a rule, however, a plurality of timewindows is provided, the number of which results from the duration ofthe individual phonemes or diphthongs during production of the referencesegments divided by the duration of the time windows.

The number of characteristic vectors per time window can vary. In thepresent embodiment, the maximum number of characteristic vectors pertime window is limited to three. It can also be expedient to limit themaximum number of characteristic vectors to considerably greater values,such as 10 to 15 for example, or to provide no corresponding limit.

The basic principle relates to the fact that for the recognition phasethe time-structured reference segments are combined to form a referencemodel and the reference model is compared with a test model which isderived from a voice signal which is spoken and to be recognized.

FIG. 1 shows the functional sequence of a method for producing referencesegments for modeling speech units. The method begins with thecollection of training utterances with multiple occurrences of allrequired speech modules (S10) together with the correspondingtranscription. This data is placed in a data bank D1 comprising thespoken voice signals and the corresponding phonetic labels. The contentsof this data bank DI are separated according to a predefinedtranscription into data records D2 which in each case are assigned to anindividual phoneme or diphthong or other suitable speech modules (S11).

The voice data stored in the data records D2 is analyzed in fixed timeslots (S12), in other words the training voice signals are subdividedinto the time windows t, as a result of which training models areobtained which in each case contain characteristic vectors in the timesequence of the training voice signal and in each case can be assignedto a speech module. These training models are stored in the data recordsD3.

An average training model having an averaged time structure and anaveraged characteristic structure is determined for each speech module(S13). In this situation, an average time curve and also an average formof the characteristics are ascertained for the available models of aspeech module. This can take place for example by way of non-linearmapping, as is described for example in chapters 4.2 and 4.3 of“Sprecherunabhängigkeit und Sprechadaption”, Bernhard R. Kämmerer,Informatik Fachberichte 244, Springer Verlag, 1990. In this situation,an average time structure is initially obtained from the individual timestructures of the test models by way of non-linear mapping and thecharacteristic vectors assigned in this situation are averaged. Theseaverage models are stored in data records D4.

As a result of a representative average time structure, in other wordsan average form of change duration and time sequence of characteristicsof the speech modules, the possibility of non-linear time adjustment inorder to compensate for minor distortions is maintained.

From the data stored in data records D3 and D4, the characteristicvectors of all models are clustered for each time window of the averagemodel of a particular speech module (S14). Known approaches such ask-means, leader algorithm or similar can be used as the clusteringmethod. This results in the production for each speech module of a modelwhich is represented by one or more characteristic vectors per timewindow. These models form the reference segments which are stored infurther data records D5. The data records D5 form the result of themethod and are the basis for the subsequent recognition.

As a result of combining the averaging of time structures with aclustering of the assigned characteristics it is possible to avoid theaforementioned disadvantages of HMM voice recognition facilities.

FIG. 2 shows in schematic form the clustered characteristic vectors forthe speech module “a” which have been obtained from examples of speechhaving a different context. In this situation, no fixed number ofalternative characteristic vectors is provided per time window, insteadit is left to the clustering method to define the number in accordancewith the variance. In the mid zone fewer characteristic vectors resultthan in the edge zone since the characteristic vectors of differentexamples of speech are very similar to one another in the mid zone butdiffer greatly in the edge zone as a result of the different contexts.Accordingly, more characteristic vectors which represent thesedifferences are produced at the edge zones as a result of the clusteringmethod.

Since the context-dependent differences at the edge zones of the speechmodules can be represented by alternative characteristic vectors, it isnot necessary to form complete reference segments for speech modules indifferent contexts, as is the situation in the case of known HMMmethods, as a result of which the number of speech modules involved canbe kept considerably lower.

FIG. 3 shows in schematic form the method for modeling speech units of aspoken test model in voice recognition systems in the form of aflowchart. FIG. 4 shows a mapping matrix in the form of a coordinatesystem for the situation where a test utterance is recognized.

This method uses a code book in which the time-structured referencesegments are stored. In the present embodiment, in each case the speechmodules represent phonemes or diphthongs and the time-structuredcharacteristics are spectral characteristic vectors which are stored inthe reference segments in the order which corresponds to a typical voicesignal of the respective speech module. The method also uses a data bankin which speech units are stored in their phonetic description. In thepresent embodiment, the speech units are words and the phoneticdescription of the German words “eins” and “zwei” shown in FIG. 4 is asfollows:

“ai” “n” “s” and “t” “s” “w” “ai”

The method begins with step S1.

In step S2, a voice signal to be recognized by a voice recognitionsystem is converted into a test model, whereby the voice signal ischanged into corresponding spectral characteristic vectors. Thesecharacteristic vectors for the test model are shown schematically alongthe x-axis.

In step S3, the reference segments are concatenated in accordance withthe phonetic description of the words stored in the data bank to form areference model in each case.

Since in the reference segments the characteristic vectors are assignedto specific time windows t and their sequence is defined, the referencemodels form a string of time-ordered and thus time-structuredcharacteristic vectors extending over a plurality of phonemes anddiphthongs.

Beside the y-axis, the individual phonemes and diphthongs for the Germanwords “eins” and “zwei”, namely “ai”, “n”, “s” and “t” . . . “ai” areshown schematically together with the corresponding reference segments.These reference segments RS are subdivided in accordance with the timewindows t of the analysis phase into 1 to 4 time windows t of apredetermined duration. In this situation, each reference segmentexhibits three characteristic vectors MV in each case for each timewindow.

In step S4, the characteristic vectors of the test model are mappedusing non-linear time mapping onto the characteristics of the referencesegments of the reference models (=combinations of successivecharacteristic vectors for a word). Non-linear mapping actions of thistype can be carried out in accordance with the previously common DTWmethods or in accordance with the Viterbi algorithm. These non-linearmapping actions are suited both for isolated recognition and also forcontiguously spoken words (continuous speech).

With regard to this mapping, the test models are mapped to all referencemodels and the respective distances between the reference models and thetest model are calculated according to a predetermined distance scale.Distortion paths VP with which the characteristic vectors of the testmodel are mapped onto the reference models are shown in the mappingmatrix.

Different distance scales are known on the basis of the related art (seefor example chapter 3.5.1 in Bernhard R. Kämmerer:“Sprecherunabhängigkeit und Sprechadaption”, Informatikfachbereich 244,Springer Verlag, 1990).

With regard to the mapping, compared with earlier methods, in the caseof each assignment of a time window of the test utterance to a timewindow of the reference models the smallest distance between the testcharacteristic vector and the existing alternative referencecharacteristic vectors is however formed.

In accordance with the specifications for non-linear mapping, theseminimum individual distances along the distortion path are accumulatedto form an overall distance for the word.

Within the scope of the above it is also possible to employ knownpruning methods when comparing the test models with the reference modelsor to limit the number of reference models to be compared with the testmodel through the use of voice models.

In step S5, the reference model which exhibits the smallest overalldistance from the test model is selected as the recognition result. Thedistortion paths printed as heavy lines indicate the mapping onto theselected reference model.

The maximum distortion of the distortion paths is preferably restrictedto a certain working range, in other words this means that no more thana certain number n of characteristic vectors of the test model may bemapped to one time window of the reference model and one characteristicvector may not be mapped to more than the specific number n of testwindows. n is an integer in the range 2 to 5. The consequence of this isthat the distortion path runs within a corridor K in the mapping matrix(FIG. 4).

With regard to the method described above, each characteristic vectordescribes a voice signal for a time window having a predeterminedduration which lies in the range 5 ms to 20 ms, and is preferably 10 ms.Instead of spectral characteristics, autocorrelation characteristics orother suitable characteristics can also be used, such as LPCcharacteristics (linear prediction coefficients), MFCC characteristics(Melfilter coefficients) or CC characteristics (capstral coefficients),for example.

With regard to the above example, each reference segment represents onephoneme or diphthong. Within the scope of the above method, however, itis possible for the reference segments to represent diphones, triphones,syllables or other suitable subunits. Similarly, in addition to a word aspeech unit can also represent a phrase or similar.

The reference segments describe speech modules by characteristicvectors, whereby the characteristic vectors are stored in a typicalsequence which results from the training voice data. For eachcharacteristic vector, alternative characteristic vectors are specified.With regard to the modeling of larger speech units, the referencesegments are combined in order to describe the speech units, as a resultof which any words are rendered recognizable with little trainingeffort. The recognition is based on a non-linear mapping, whereby thealternative characteristic vectors are used for determining the localcorrespondence. The non-linear mapping can be used both for individuallyspoken words and also for continuous speech.

The differences from HMM voice recognition facilities and thedetermination of the average time structure will be explained again indetail in the following with reference to FIGS. 1 to 5.

With regard to state-of-the-art HMM voice recognition systems havingspeaker-independent recognition, speech samples from a very large numberof speakers are collected for training purposes. In this situation, theselection of the references for the subsequent recognition phase takesplace in such a manner that

1. training utterances are spoken as specified,

2. the resulting voice signals are (frequency) analyzed in a fixed timepattern (10 ms-20 ms) and the characteristics are stored,

3. the characteristic sequence (iterative) is subdivided timewise inaccordance with the phonetic transcription, whereby generally eachphoneme is additionally further subdivided by a fixed number of states(for example, 3 states: start of phoneme-phoneme core-end of phoneme),

4. a set of representatives which can be placed in a “code book” for thesubsequent recognition phase is selected from the individualcharacteristics of all sections from all utterances which correspond toa state (for example by way of clustering methods). As an alternative tothis, distributions can also be stored by way of the characteristiccomponents.

Thereafter, for example, three groups each having a plurality ofrepresentatives are therefore represented in the code book for eachphoneme: For example, as shown in FIG. 5 for the phoneme “a”:

Start Characteristic a(A,I)

-   -   . . .    -   Characteristic a(A,x)

Middle Characteristic a(M,1)

-   -   . . .    -   Characteristic a(M,y)

End Characteristic a(E,1)

-   -   . . .    -   Characteristic a(E,z)

It is important that there is no longer any sequence of characteristicswithin the states. All can occur at any position with the sameprobability.

In the recognition phase (Viterbi search) the characteristics of thesignal to be recognized are compared with the representatives (distancecalculation for example). The minimum distance is then selected within astate. In this situation, the original sequence of the characteristicswithin a state plays no part. The dwell time in a state is definedeither by way of the fixed iteration probability when compared with thetransition probability to the next state (exponential reduction in theoverall probability according to the number of iterations) or iscontrolled by way of an average duration determined from the trainingdata (for example Gaussian probability distribution around theduration). The coarse structure is defined only by the sequence of thestates (in other words, with regard to the non-linear mapping (Viterbi),it is necessary to go initially from the “start” state, via the “middle”state to the “end” state.

The following disadvantages result from the above:

-   -   Inadequate time representation. The fixed division into three        states, for example, does not do justice to the extremely        variable duration of the real forms of phonemes. An “a” or a        glide “ai”, for example, can have an extremely long duration        (perhaps 20 analysis windows=200 ms), then a very large number        of characteristics are mapped onto one state. This corresponds        to a disproportionately coarse representation. A “p” can be very        short (perhaps only 3 analysis windows), from which an        excessively fine modeling results.    -   Reduced selectivity. With regard to long phonemes, the Viterbi        search can find the most favorable representative for each        state. This can result in the fact, for example, that a        representative which actually stands for the end of the state is        used for a large area of the test signal. This may result in an        overall distance which is altogether too small and a loss in        differentiating capability. Particularly affected here are words        whose phonemes are similar and where a certain overlapping of        characteristics arises as a result of errors in analysis.    -   Coarse time distortion with regard to the mapping. Since the        test signal is represented “fully resolved” (in other words in        the original 10 ms sequence of characteristics) but the        references are represented with the few states, the mapping must        also map larger sections of the test signal to one state. This        does not take into consideration the fact that the speaking        speed can only change within tight limits. (A common effect in        this situation is the fact that, for example, dictation systems        can work better with quickly spoken utterances than with those        which are spoken normally or slowly).

These disadvantages should be reduced and/or eliminated by the abovemethod.

In this situation the phonemes are described not by a fixed number ofstates but by a model (=a sequence of characteristics) which is obtainedfrom the training utterances. The idea behind this is to implement asimilar resolution on the reference side as on the test side. Theindividual steps required to achieve this are as follows:

1. training utterances are spoken as specified,

2. the resulting voice signals are (frequency) analyzed in a fixed timepattern (10 ms-20 ms) and the characteristics are stored,

3. the characteristic sequence (iterative) is subdivided timewise inaccordance with the phonetic transcription, whereby models are extractedin each case for the phonemes,

4. from the models obtained from all utterances which correspond to onephoneme, an average “time structure” is calculated in respect of thetime sequence of the characteristics, the number of characteristics andalso the form of the characteristics,

5. a small code book containing representative characteristics iscreated (in order to cover the speaker-specific characteristics) foreach “frame” (in other words each analysis section) of the average timestructure.

FIGS. 6 and 7 are intended to visualize step 4. The models which standfor the phoneme string “ts” are spoken and segmented from differentutterances. The averaging of a very short “ts” with 11 analysis windowsand a long “ts” with 17 analysis windows results in an average modelwith 14 analysis windows and spectral characteristics which likewiseexhibit an average characteristic and represent the “time structure”(characteristic sequence).

In the second example the glide “ai” has been uttered, whereby thelength of the sounds here and thus also of the averaging are practicallyidentical, but the effect of the averaging can be recognized in thecharacteristic sequence of the characteristics.

The examples show an averaging from n=2 training models. By analogy,this naturally also applies to any value for n.

FIG. 8 shows the methods used during averaging for the non-linearmapping of i and j and also the projection onto a resulting model i.

For step 5, the resulting average structure is now retained. All themodels observed during the training are mapped in non-linear fashiononto this structure. A clustering of the characteristics mapped is thencarried out for each frame of the average structure.

In the recognition phase, the words to be recognized are composed fromthe references for the phonemes in accordance with the phonetictranscription. With regard to the “traditional” HMM, it is the stateswhich are arranged side by side, whereas with the proposed method it isthe reference segments. The so-called search (the optimum mapping of thetest signal onto the formed references) then takes place in the case ofHMM by way of a Viterbi algorithm for example, whereas with the proposedmethod it takes place by way of the (more general) “dynamic timewarping” approach. With these methods, the search space is defined bythe permitted gradients occurring in the case of the transition from onegrid point (in the matrix made up of reference and test models) to thenext. In this situation, a gradient of “1” signifies a linear mapping,whereas a gradient of “0” signifies a collapse of the entire test modelto one state of the reference model and an “infinite” gradient signifiesa collapse of the entire reference model to one analysis frame of thetest model.

As can be seen from FIGS. 9 and 10, with regard to the Viterbi algorithmit is also necessary to permit “0” gradients as a result of thediffering resolution of reference and test. With the new method,however, the mapping can for example be restricted to a range (0.5 . . .2), in other words it is assumed that in comparison with an averagespeaking speed the test signal can be spoken half as fast as a minimumand twice as fast as a maximum. As a result of the restricted mappingrange the search is “forced” to also compare all relevant model sectionsand not to simply skip entire sections.

The invention has been described in detail with particular reference topreferred embodiments thereof and examples, but it will be understoodthat variations and modifications can be effected within the spirit andscope of the invention.

1. A method for producing reference segments describing speech modules,for a voice recognition system, comprising: phonetically segmenting aspoken training voice signal into speech modules in accordance with apredefined transcription; subdividing each speech module into a sequenceof time windows; analyzing the spoken training voice signal in each timewindow to obtain a characteristic vector for each time window and obtaina training model from a sequence of characteristic vectors correspondingto the sequence of time windows, each speech module having a pluralityof training models corresponding to a plurality of differentpronunciations for the speech module; forming an average time structurefor each speech module, the average time structure being formed bycomparing the plurality of training modules for the speech module, theaverage time structure containing information regarding an averagepronunciation speed and style, the average time structure having aplurality of time windows, the average time structure being formed bymapping the characteristic vectors of the different training models ontothe time windows of the average time structure such that each timewindow of the average time structure contains a plurality ofcharacteristic vectors, the characteristic vectors being mapped using anon-linear mapping; and saving the plurality of time windows for theaverage time structure as a reference segment.
 2. A method for producingreference segments for a voice recognition system, comprising:phonetically segmenting a training voice signal into speech modules inaccordance with a predefined transcription; analyzing the training voicesignal in predetermined time windows in order to obtain at least onecharacteristic vector for each time window, as a result of whichtraining models are formed which in each case contain characteristicvectors in the time sequence of the training voice signal; determiningan average time structure, which is an average of change duration andtime sequence characteristics, for each speech module; assigning thecharacteristic vectors to the average time structure by a temporallynon-linear mapping to produce a reference segment; and storing thereference segment.
 3. The method according to claim 2, wherein thetraining voice signal is segmented into speech modules to separatephonemes, diphthongs, diphones, triphones or syllables.
 4. The methodaccording to claim 2, wherein the characteristic vectors of the trainingmodels represent spectral characteristics, autocorrelationcharacteristics, LPC characteristics, MFCC characteristics or CCcharacteristics.
 5. The method according to claim 2, wherein the averagetime sequence is obtained by performing non-linear mappings of thetraining models on the speech module to one another and by averaging themappings.
 6. The method according to claim 2, further comprisingclustering the characteristic vectors of the time windows.
 7. The methodaccording to claim 6, wherein the number of characteristic vectors pertime window are limited to a particular number.
 8. The method accordingto claim 3, wherein the characteristic vectors of the training modelsrepresent spectral characteristics, autocorrelation characteristics, LPCcharacteristics, MFCC characteristics or CC characteristics.
 9. Themethod according to claim 8, wherein the average time sequence isobtained by performing non-linear mappings of the training models on thespeech module to one another and by averaging the mappings.
 10. Themethod according to claim 9, further comprising clustering thecharacteristic vectors of the time windows.
 11. The method according toclaim 10, wherein the number of characteristic vectors per time windoware limited to a particular number.
 12. The method according to claim11, wherein the number of characteristic vectors corresponds to avariance in the characteristic vectors for the training models, suchthat if there is a greater variance, more characteristic vectors areused.
 13. A method for producing reference segments for a voicerecognition system, comprising: segmenting a training voice signal intospeech modules in accordance with a predefined transcription; analyzingthe training voice signal in predetermined time windows in order toobtain at least one characteristic vector for each time window, as aresult of which training models are formed which in each case containcharacteristic vectors in the time sequence of the training voicesignal; determining an average time structure, which is an average ofchange duration and time sequence characteristics, for each speechmodule; assigning the characteristic vectors to the average timestructure by a temporally non-linear mapping to produce a referencesegment; storing the reference segment; and clustering thecharacteristic vectors of the time windows, wherein the number ofcharacteristic vectors corresponds to a variance in the characteristicvectors for the training models, such that if there is a greatervariance, more characteristic vectors are used.
 14. A method formodeling speech units of a spoken test model in a voice recognitionsystem, comprising: producing reference segments describing speechmodules for a voice recognition system, comprising: phoneticallysegmenting a spoken training voice signal into speech modules inaccordance with a predefined transcription; subdividing each speechmodule into a sequence of time windows; analyzing the spoken trainingvoice signal in each time window to obtain a characteristic vector foreach time window and obtain a training model from a sequence ofcharacteristic vectors corresponding to the sequence of time windows,each speech module having a plurality of training models correspondingto a plurality of different pronunciations for the speech module;forming an average time structure for each speech module, the averagetime structure being formed by comparing the plurality of trainingmodules for the speech module, the average time structure containinginformation regarding an average pronunciation speed and style, theaverage time structure having a plurality of time windows, the averagetime structure being formed by mapping the characteristic vectors of thedifferent training models onto the time windows of the average timestructure such that each time window of the average time structurecontains a plurality of characteristic vectors, the characteristicvectors being mapped using a non-linear mapping; and saving theplurality of time windows for the average time structure as a referencesegment; forming a plurality of reference models, each reference modelbeing formed by combining a plurality of reference segments, eachreference model representing a speech unit; performing a non-linearcomparison of the reference models with the test model and determiningin each case a distance between the reference model and the test model;and selecting the reference model having the smallest distance from thetest model, whereby the speech unit represented by the referencesegments is assigned to the test model.
 15. The method according toclaim 14, wherein each reference model represents a word to berecognized.
 16. The method according to claim 15, wherein each referencemodel is formed from a concatenation of the reference segments inaccordance with the transcription.
 17. The method according to claim 16,wherein the non-linear comparison is effected by a non-linear timeadjustment of the test model to the reference models for the words to berecognized.
 18. The method according to claim 17, wherein the non-lineartime adjustment is restricted to a defined working range.
 19. The methodaccording to claim 18, wherein each reference segment has acharacteristic vector, the test model has a characteristic vector, inperforming the non-linear comparison, a distance is determined betweenthe characteristic vector of the test model and each of thecharacteristic vectors of the reference segment, and the distance isdetermined to be the minimum of the distances between the characteristicvector of the test model and the characteristic vectors of the referencesegments.
 20. The method according to claim 19, wherein distortion islimited in the non-linear mapping.
 21. The method according to claim 14,wherein the non-linear comparison is effected by a non-linear timeadjustment of the test model to the reference models for the words to berecognized.
 22. The method according to claim 21, wherein the non-lineartime adjustment is restricted to a defined working range.
 23. The methodaccording to claim 14, wherein each reference segment has acharacteristic vector, the test model has a characteristic vector, inperforming the non-linear comparison, a distance is determined betweenthe characteristic vector of the test model and each of thecharacteristic vectors of the reference segment, and the distance isdetermined to be the minimum of the distances between the characteristicvector of the test model and the characteristic vectors of the referencesegments.
 24. The method according to claim 14, wherein distortion islimited in the non-linear mapping.